| Patrice Riemens on Tue, 24 Mar 2009 06:37:38 -0400 (EDT) |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
| <nettime> Ippolita Collective: The Dark Face of Google, Chapter 4 (First part) |
NB this book and translation are published under Creative Commons license
2.0 (Attribution, Non Commercial, Share Alike).
Commercial distribution requires the authorisation of the copyright
holders:
Ippolita Collective and Feltrinelli Editore, Milano (.it)
Ippolita Collective
The Dark Side of Google (continued)
Chapter 4 Algorithms or Bust! (Part 1)
Google's mind-boggling rate of growth has not at all diminished its
reputation as a fast, efficient, exhaustive, and accurate search engine:
haven't we all heard the phrase "if it's not on Google, it doesn't
exist!", together with "it's faster with Google!". At the core of this
success lies, besides elements we have discussed before, the PageRank[TM]
algorithm /we mentioned in the introduction/ which steers Google's
spider's forays through the Net. Let's now look more closely at what it
is, and how it works.
Algorithms and real life
An algorithm [*N1] is a method to resolve a problem, it is a procedure
built up of sequences of simple steps leading to a certain {desired}
result. An algorithm that actually does solve a problems is said to be
accurate, and if it does so speedily, it is also efficient. There are many
different types of algorithms, and they are used in the most diverse
scientific domains. Yet, algorithms aren't some kind of arcane procedures
concerning {and known} only {to} a handful of specialists, they are
devices that profoundly influence our daily lives, much more so than would
appear at first sight.
Take for instance the technique used to tape a television programme: based
on algorithms; but so also methods to put a pile of papers in order, or to
sequence the stop-overs of a long journey. Within a given time, by going
through a number of simple, (re)replicable steps, we make a more or less
implicit choice of algorithms that apply to the problem solving issue at
hand. 'Simple' in this regard, means foremost unequivocal, readily
understandable for who will put the algorithm to work. Seen in this light,
a kitchen recipe is an algorithm: "bring three liters water to the boil in
a pan, add salt, throw in one pound of rice, cook for twelve minutes and
sieve, serve with a sauce to taste", all this is a simple step-by-simple
step description {of a cooking process}, provided the reader is able to
interpret correctly elements such as "add salt", and "serve with a sauce
to taste".
Algorithms are not necessarily a method to obtain completely detailed
results. Some are intended to arrive at acceptable results {within a given
period of time} [French text: 'without concern for the time factor' -
which doesn't sound very logical to me -TR]; others arrive at results
through as few steps as possible; yet others focus on using as few
resources as feasible [*N2].
It should also be stressed /before going deeper into the matter/ that
nature itself is full of algorithms. Algorithms really concern us all
because they constitute concrete practices meant to achieve a given
objective. In the IT domain they are used to solve recurrent problems in
software programming, in designing networks, and in building hardware.
Since a number of years, due to the increasing importance of network-based
reality analysis and interpretation models, many researchers have focused
their studies on the construction methods and network trajectories of the
data which are the 'viva materia' of algorithms. The 'economy of search'
John Batelle writes about [*N3] has become possible thanks to the {steady}
improvement of the algorithms used for information retrieval, developed in
order to augment the potential of data discovery and sharing, this with an
{ever} increasing degree of efficiency, speed, accuracy, and security. The
instance the general public is the most familiar with is the
'peer-to-peer' {('P2P')} phenomenon: instead of setting up humongous
databases for accessing videos, sound, texts, software, or any other kind
of information {in digital format}, ever more optimised algorithms are
being developed all the time, facilitating the creation of extremely
decentralised networks, through which any user can make contact with any
other user in order to engage in {mutually} beneficial exchanges.
The Strategy of objectivity
The tremendous increase of the quantity and quality of bandwidth, and of
memory in our computers, together with rapidly diminishing costs, has
enabled us to surf the Internet longer, better, and faster. Just twenty
years ago, modems, with just a few hundred bauds (number of 'symbols'
transmitted per second) of connectivity, were the preserve of an elite.
Today, optic fiber criss-crosses Europe, carrying millions of bytes per
second, and is a technology accessible to all. Ten years ago, a fair
amount of technical knowledge was required to create digital content.
Today, the easiness of publishing on the World Wide Web, the omnipresence
of e-mail, the improvement of all kinds of online collective writing
systems, such as blogs, wikis, portals, mailing lists, etc. together with
the dwindling costs of registering Internet domains and addresses, have
profoundly changed the nature of users: from simple users of information
made available to them by IT specialists, they have increasingly become
creators of information themselves.
The increase in the quality of connectivity goes together with an
exponential augmentation of the quantity of data send over the networks,
which, as we have pointed out earlier, entails the introduction of
steadily better performing search instruments. The phenomenon that
represents this pressing necessity exerts a deep attraction on social
scientists, computer science people, ergonomists, designers, specialist in
communication, and a host of other experts. On the other hand, the
'informational tsunami' that hits the global networks cannot be
interpreted as mere 'networkisation' of societies as we know them, but
must be seen as a complex phenomenon needing a completely fresh approach.
We therefore believe that such a theoretical endeavour cannot be left to
specialists alone, but demand a collective form of elaboration.
If indeed the production of DIY network constitutes an opportunity to link
autonomous realms together, we must also realise that the tools of social
control embedded in IT technologies represent a formidable apparatus of
repression.
The materialisation of this second scenario, most spectacularly
exemplified by the Echelon eavesdropping system [*N5], looks
{unfortunately} the most probable, given the steadily growing number of
individuals who are giving information away, as opposed to an ever
diminishing number of providers of search tools. The access to the
information that is produced by this steadily growing number of
individuals is managed with an iron hand by people who are both retaining
the monopoly of it while at the same time reduce what is a tricky social
issue into a mere marketing free-for-all contest where the best algorithm
wins.
A search algorithm is a technical tool activating an extremely subtle
marketing mechanism, as the user trust that the search returns are not
filtered and correspond to choices made by the 'community' of surfers. To
sum up, a trust mechanism is triggered into the objectivity of the
technology itself, recognised as 'good' because it is free from human
individuals' usual idiosyncratic influences and preferences. The 'good'
machines, themselves issued from 'objective' science, and 'unbiased'
research, will not tell lies since they cannot lie, and in any case don't
have any interest in doing so. Reality, however, is very much at variance
with this belief, which proves to be a demagogic presumption - the cover
for fabulous profits from marketing and control.
Google's case is the most blatant example of this technology-based
'strategy of objectivity'. Its 'good by definition' search engine keeps
continuous track of what its users are doing in order to 'profile' their
habits and exploits this information by inserting personally targeted and
contextualised ads into all their activities (surfing, e-mailing, file
handling, etc.). 'Lite' ads for sure, but all pervasive, and even able to
generate feedback, so that users can, in the simplest way possible,
provide information to vendors, and thus improve the 'commercial
suggestions' themselves by expressing choices. This continuous soliciting
of users, besides flattering them into thinking that are participants in
some vast 'electronic democracy', is in fact the simplest and most
cost-effective way to obtain commercially valuable information about the
tastes of consumers. The users' preferences and their ignorance {about the
mechanism unleashed on them}) is what constitutes and reinforces the
hegemony of a search engine, since a much visited site can alter its
content as consequence of the outcome of its 'commercial suggestions': a
smart economic strategy indeed.
Seen from a purely computer science point of view, search engines perform
four tasks: retrieving data from the Web (spider); stocking information in
appropriate archives (databases); applying the correct algorithm to order
data in accordance with the query, and finally, presenting results on an
interface in a manner that satisfies the user. The first three tasks each
requires a particular type of algorithm: search & retrieval; memorisation
& archiving; and query. Google's power, just as Yahoo!'s and other search
giants /on the network/ is therefore based on: 1. A 'spider', that is a
piece of software that captures content /on the net/ 2. An enormous
capacity to stock data on secure carriers, and a lot of backup facilities,
to avoid any accidental loss of data. 3. An extremely fast system able to
retrieve and order the returns of a query, according to the ranking of the
pages. 4. An interface at the user's side to present the returns of the
queries requested (Google Desktop and Google Earth, however, are
programmes the user must install on her/his machine beforehand).
Spiders, databases and searches
The spider is an application that is usually developed in the labs of the
search engine companies. Its task is to surf web pages from one link to
the next while collecting information, such as document format, keywords,
page authors, next links, etc. When done with its /exploratory/rounds, the
spider software sends all this to the database for archiving /this
information/. Additionally, the spider must monitor any changes on the
sites visited so as to be able to programme its next visit and stock fresh
data. The Google spider, for instance, manages two types of site-scans,
one monthly and elaborate, the so-called 'deep crawl', the other daily,
'fresh crawl', for updating purposes. This way, Google's databases are
continuously updated /by the spider through its network surfing/. After
every 'deep crawl', Google needed a few days to actualise the various
indexes and to communicate the new results to all {its} data-centers. This
lag time is known as the "Google Dance": the search returns used to be
variable, since they stemmed from different indexes. But Google has
altered its cataloguing and updating methods from 2003 onwards, and has
also spread them much more in time, resulting in a much less pronounced
'dance': now the search results vary in a dynamic and continuous fashion,
and there are no longer periodic 'shake-ups'. In fact, the search returns
will even change according to users' surfing behaviour, which is archived
and used to 'improve', that is to 'simplify' the identification of {the}
information {requested} [*N6].
The list of choices the application is working through in order to index a
site is what constitutes the true force of the Google algorithm. And while
the PageRank[TM] algorithm is patented by Stanford, and is therefore
public, later alterations have not been /publicly/ revealed by Google,
nor,{by the way}, by any other search engine company existing at the
moment. And the back-up and recovery methods used in the data centers are
not being made public either.
Again, from a computer science point of view, a database is merely an
archive in digital format: in its simplest, and till now also its most
common form, it can be represented as one of more tables which are linked
together and which have enter and exit values: these are called relational
databases. A database, just like a classic archive, is organised according
to precise rules regarding stocking, extraction and continuous enhancement
of {the quality of} the data /themselves/ (think recovery of damaged data,
redundancy avoidance, continuous updating of data acquisition procedures,
etc). IT specialists have been studying for decades now the processes of
introduction, quality improvement, and search and retrieval within
databases. To this end, they have experimented with various approaches and
computer languages (hierarchies rankings, network and relational
approaches, object oriented programming, etc.). The building up of a
database is crucial component of the development of a complex information
system such as Google's, as its functionality is entirely dependent on it.
In order to obtain a swift retrieval of data, and more generally, an
efficient management of the same, it is essential to identify correctly
what the exact purpose of the database is (and, in the case of relational
databases, the purpose of the tables) which must be defined according to
the domains and the relations that link them together. Naturally, it
becomes also necessary to allow for approximations, something that is
unavoidable when one switches from natural, analog languages to digital
data. The itch resides in the secrecy of the methods: as is the case with
all proprietary development projects, as opposed to those which are open
and free, it is very difficult to find out which algorithms and programmes
have been used.
Documents from research centers and universities allow a few glimpses of
information on proprietary projects, as far as it has been made public.
They contain are some useful tidbits to understand the structure of the
computers {used} and the way search engines are managing data. Just to
give an idea of the computing power available today, one finds
descriptions of computers which are able to resolve in 0,5 microsecond
Internet addresses into the unique bits sequences that serve to index in
databases, while executing 9000 spiders {'crawls') at the same time. These
systems are able to memorize and analyze 50 million web pages a day [*N7].
The last algorithmic element hiding behind Google's 'simple' facade is the
search system, which, starting from a query by the user, is able to find,
order, rank and finally return the most pertinent results to the
interface.
A number of labs and universities have by now decided to make public their
research in this domain, especially regarding answers {to problems} that
have been found, and the various methods used to optimise access speed to
the data, {questions about} the complexity of the systems, and the most
interesting instances of parameter selection.
Search engines must indeed be able to provide almost instantaneously the
best possible results while at the same time offering the widest range of
choice. Google would without doubt appear as the most advanced search
engine of the moment: as we will see /in details/ in the next chapter,
these extraordinary results cannot but be the outcome of a very
'propitious' {form of} filtering...
For the time being, suffice to say that the best solution resides in a
proper balance between computing power and the quality of the of the
search algorithm. You need truly extraordinary archival supports and
indexation systems to find the information you are looking for when the
mass of data is written in terabytes (1 TB = 1000 gigabytes = 1000 raised
to 3 bytes), or even in petabytes (1 PB = 1000 TB [or 1024 TB ,
Wikipedia's funny...-TR]), and also a remarkable ability to both determine
where the information is in the gigantic archive and to calculate the
{fastest} time needed to retrieve it.
And as far as Google's computing capacities are concerned, the Web is full
of - not always verifiable nor credible - {myths and} legends, especially
since the firm is not particularly talkative about its technological
infrastructure. Certain sources are buzzing about lakhs [See Chapter 1
;-)] of computers interconnected through thousands of gigantic 'clusters'
[sitting on appropriate GNU/Linux distros - French text unclear]; others
talk about mega-computers, whose design comes straight out SciFi
scenarios: humongous freeze-cooled silo's where a forest of mechanical
arms move thousands of hard disks at lightning speed. Both speculations
are just as plausible {or fanciful}, and do not necessarily exclude each
other. In any case, it is obvious that the extraordinary flexibility of
Google's machines allows for exceptional performances, as long as the
system remains 'open' - to continuous {in-house} improvements, that is.
(to be continued)
--------------------------
Translated by Patrice Riemens
This translation project is supported and facilitated by:
The Center for Internet and Society, Bangalore
(http://cis-india.org)
The Tactical Technology Collective, Bangalore Office
(http://www.tacticaltech.org)
Visthar, Dodda Gubbi post, Kothanyur-Bangalore
(http://www.visthar.org)
# distributed via <nettime>: no commercial use without permission
# <nettime> is a moderated mailing list for net criticism,
# collaborative text filtering and cultural politics of the nets
# more info: http://mail.kein.org/mailman/listinfo/nettime-l
# archive: http://www.nettime.org contact: nettime@kein.org